TweetNorm: a benchmark for lexical normalization of Spanish tweets

نویسندگان

Iñaki Alegria

Nora Aranberri

Pere Comas

Víctor Fresno-Fernández

Pablo Gamallo

Lluís Padró

Iñaki San Vicente

Jordi Turmo

Arkaitz Zubiaga

چکیده

The language used in social media is often characterized by the abundance of informal and non-standard writing. The normalization of this non-standard language can be crucial to facilitate the subsequent textual processing and to consequently help boost the performance of natural language processing tools applied to social media text. In this paper we present a benchmark for lexical normalization of social media posts, specifically for tweets in Spanish language. We describe the tweet normalization challenge we organized recently, analyze the performance achieved by the different systems submitted to the challenge, and delve into the characteristics of systems to identify the features that were useful. The organization of this challenge has led to the production of a benchmark for lexical normalization of social media, including an evaluation framework, as well as an annotated corpus of Spanish tweets –TweetNorm es–, which we make publicly available. The creation of this benchmark and the evaluation has brought to light the types of words that submitted systems did best with, and posits the main shortcomings to be addressed in future work.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Elhuyar at Tweet-Norm 2013

This paper presents the system developed by Elhuyar for the TweetNorm evaluation campaign which consists of normalizing Spanish tweets to standard language. The normalization covers only the correction of certain Out Of Vocabulary (OOV) words, previously identified by the organizers. The developed system follows a two step strategy. First, candidates for each OOV word are generated by means of ...

متن کامل

Lexical Normalization of Spanish Tweets with Preprocessing Rules, Domain-specific Edit Distances, and Language Models

We present a system to normalize Spanish tweets, which uses preprocessing rules, a domain-appropriate edit-distance model, and language models to select correction candidates based on context. The system’s results at SEPLN 2013 Tweet-Norm task were above-average.

متن کامل

TweetNorm: Text Normalization on Italian Twitter Data

This paper addresses the issue of text normalization on non-standard Italian data. We present TweetNorm1, a system which normalizes Italian tweets in a way that the amount of microblog slang and distorted text appearance is drastically reduced and the normalized output has a much cleaner and more formal style. The paper shows that with a set of fixed language-independent rules and trained rules...

متن کامل

The Denoised Web Treebank: Evaluating Dependency Parsing under Noisy Input Conditions

We introduce the Denoised Web Treebank: a treebank including a normalization layer and a corresponding evaluation metric for dependency parsing of noisy text, such as Tweets. This benchmark enables the evaluation of parser robustness as well as text normalization methods, including normalization as machine translation and unsupervised lexical normalization, directly on syntactic trees. Experime...

متن کامل

Lexical Normalization of Spanish Tweets with Rule-Based Components and Language Models

This paper explores three different methods of learning to map variant word form (dialectal or diachronic) to standard ones from a limited parallel corpus of standard and variant texts, given that a computational description of the standard morphology is available.

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Language Resources and Evaluation

دوره 49 شماره

صفحات -

تاریخ انتشار 2015

TweetNorm: a benchmark for lexical normalization of Spanish tweets

نویسندگان

چکیده

منابع مشابه

Elhuyar at Tweet-Norm 2013

Lexical Normalization of Spanish Tweets with Preprocessing Rules, Domain-specific Edit Distances, and Language Models

TweetNorm: Text Normalization on Italian Twitter Data

The Denoised Web Treebank: Evaluating Dependency Parsing under Noisy Input Conditions

Lexical Normalization of Spanish Tweets with Rule-Based Components and Language Models

عنوان ژورنال:

اشتراک گذاری